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Abstract 

Background: Genome-wide association studies (GWAS) have identified hundreds of genetic variants associated 
with complex human diseases, clinical conditions and traits. Genetic mapping of expression quantitative trait loci 
(eQTLs) is providing us with novel functional effects of thousands of single nucleotide polymorphisms (SNPs). In a 
classical quantitative trail loci (QTL) mapping problem multiple tests are done to assess whether one trait is 
associated with a number of loci. In contrast to QTL studies, thousands of traits are measured alongwith thousands 
of gene expressions in an eQTL study. For such a study, a huge number of tests have to be performed (~ lo''). 
This extreme multiplicity gives rise to many computational and statistical problems. In this paper we have tried to 
address these issues using two closely related inferential approaches: an empirical Bayes method that bears the 
Bayesian flavor without having much a priori knowledge and the frequentist method of false discovery rates. 
A three-component t-mixture model has been used for the parametric empirical Bayes (PEB) method. Inferences 
have been obtained using Expectation/Conditional Maximization Either (ECME) algorithm. A simulation study has 
also been performed and has been compared with a nonparametric empirical Bayes (NPEB) alternative. 

Results: The results show that PEB has an edge over NPEB. The proposed methodology has been applied to 
human liver cohort (LHC) data. Our method enables to discover more significant SNPs with FDR<10% compared to 
the previous study done by Yang et al. {Genome Research, 2010). 

Conclusions: In contrast to previously available methods based on p-values, the empirical Bayes method uses local 
false discovery rate (Ifdr) as the threshold. This method controls false positive rate. 



Introduction 

Genome-wide association studies (GWASs) have done a 
remarlcable progress in searching for susceptibiUty genes. 
In GWAS, instead of one gene at a time, variation across 
the entire genome is tested for association with disease 
risk. GWASs exploit the Unlcage disequiUbrium (LD) rela- 
tionships among single nucleotide polymorphisms (SNPs), 
making it possible to assay genome by testing a finite 
number of SNPs. Till date, the signals that can be discov- 
ered through GWAS has not been reported exhaustively. 
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It is important to annotate SNPs information on expres- 
sion for the better understanding of the genes and 
mechanisms driving the association. In many situations, 
there are more common variants truly associated with dis- 
ease. These variants are highly likely to be expression 
quantitative trait loci (eQTLs). eQTLs are derived from 
polymorphisms in the genome that result in differential 
measurable transcript levels. Microarrays are used to mea- 
sure gene expression levels across genetic mapping popu- 
lations. For at least a subset of complex disorders, gene 
expression levels could be used as a surrogate/biomarker 
for classical phenotypes. The gene underlying the eQTL is 
considered to be an excellent candidate for phenotypic 
QTL. 
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eQTL mapping is a statistical technique to locate geno- 
mic intervals, that are likely to regulate the expression of 
each transcript, by correlating quantitative measurements 
of mRNA expression with genetic polymorphisms segre- 
gating in a population. In a GWAS, millions of SNPs are 
tested at once. Associations that initially appear to be sig- 
nificant must be statistically adjusted to account for the 
large number of tests being performed. A large number 
of false positives will result in if this correction is ignored. 
The multiple-testing correction, however, sets a very high 
threshold for genome-wide significance, on the order of 
5 X 10~^ when a million SNPs are tested. In the vast 
majority cases, however, association studies have 
achieved only limited success. Large sample sizes are 
needed to achieve sufficient statistical power to detect 
risk alleles with effects weak enough to have escaped 
detection in the past; the disease risk alleles identified by 
GWASs so far do have weak effects, each with odds 
ratios of 1.1 or 1.2 [1]. 

Two closely related inferential procedures for multiple 
testing have been discussed in this work-afrequentist 
approach based on Benjamini and Hochberg's ([2]) false 
discovery rate procedure, and an empirical Bayes metho- 
dology developed in Efron et al. [3,4]. These two methods 
are not only very closely related, they can be used to sup- 
port each other. In a classic two-sample problem in a 
microarray experiment, these approaches have been dis- 
cussed by Efron and Tibshirani[5]. However, they have 
considered nonparametric empirical Bayes (NPEB) model. 
Parametric Bayesian modeling has been considered by 
Newton et al. [6], Lee et al. [7], Kendziroski et al. [8-10], 
Gelfond et al. [11]. Hierarchical models like gamma- 
gamma [6] or lognormal-normal [8] are used quite often 
in PEB procedures. These models suffer from a serious 
drawback that the variation is constant among genes. An 
extension has been done to these models by considering 
gene specific variations [12]. The application of empirical 
Bayes has been somehow not very common in literature. 
The obvious reason is that, experimenters have not 
brought us many data sets having the parallel structure 
necessary for empirical Bayes to do its stuff. Because of 
the recent surge in high-throughput ([13]) technologies 
and genome projects, many genome studies are now 
underway. These studies have become a major data gen- 
erator in the post-genomics era. Empirical Bayes proce- 
dures seem to be particularly well-suited for combining 
information in expression data. 

One of the fundamental statistical problems in micro- 
array gene expression analysis is the need to reduce 
dimensionality of the transcripts. This can be achieved 
by identifying differentially expressed (DE) genes under 
different conditions or groups. Regulatory network can 
be obtained by associating differential expressions with 
the genotype of molecular markers. It is possible to have 



a large number of DE genes that influences a certain 
phenotype while their relative proportion is very small. 
It is very important to identify these DE genes from 
among the number of recorded genes [6,7,9,14,15]. 
Empirical Bayes methods provide a natural approach to 
reduce the dimensionality significantly [16,17]. Following 
the empirical Bayes approach DE genes are identified 
using the posterior probability for differential expres- 
sion. EB approaches detect a DE gene by sharing infor- 
mation across the whole genome. 

The development of the empirical Bayes methodologies 
that improve the power to detect DE genes essentially 
reduces to the choice of whether gene-specific effects 
should be modeled as fixed or random [18]. Both mean 
and error variance can be of either of these two: fixed or 
random. Fixed mean and random error variance has been 
considered by Wright and Simon [19] and Cui et al. [20] 
whereas Lonnstedt et al. [21], Tai and Speed [22], Lonn- 
stedt and Speed [23] have considered both the para- 
meters to be random. Random mean effect with 
homogeneous fixed error variance has been considered 
by Newton et al. [6,24], Kendziroski et al. [9] and Kend- 
ziroski et al. [10]. However an extension to this fixed 
error variance has been considered by Gelfond et al. [11]. 
They have considered discrete uniform prior for the var- 
iance component. 

The paper is organized as follows. In the Methods sec- 
tion we introduce the necessary notations for our addi- 
tive genetic model along with the notions of false 
discovery rate (fdr). In this section we have tried to estab- 
lish the relationship between fdr and empirical Bayes. 
Methods section also describes, the proposed Expecta- 
tion/Conditional Maximization Either (ECME) (Liu and 
Rubin [25]) in details. This algorithm generalizes the 
Expectation-Maximization algorithm with better conver- 
gence rate. A simulation study has been performed and 
described in the Results section. We show that proposed 
parametric empirical Bayes performs better compared to 
nonparametric empirical Bayes in terms of controlled fdr. 
In the Results section, as an application, we have applied 
the proposed methodology to the Liver Cohort (LHC) 
dataset. We conclude the article the Discussion section. 

Methods 

In a microarray experiment, we obtain several thousand 
expression values, one or many for each gene. These 
studies offer an unprecedented ability to do large-scale 

studies of gene expression. Let us define G;/ = 1 / as 

the genomic marker(i.e. SNP), and Tj{j = 1 J) as the 

transcripts. The identified eQTLs refer to the significant 
Gs that are associated with Ts. These associations can 
be found using a test statistics based on all n samples. 
The genetic model for this association can be one of the 
three models: dominant, recessive and additive. Under 
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the dominance model, we can have two genotypes for 
each of the SNPs. However for an additive model, three 
genotype groups are available. A transcript Tj is assumed 
to be associated with marker G; if the mean level of 
expression of transcript Tj for one genotype group is dif- 
ferent from that of the other genotype group corre- 
sponding to that marker. Let fij^^. and fi^j. be the group 
means corresponding to the genotypes G,. To test the 
hypothesis Hq : = /U.^°g> a few test statistics are pro- 
posed for microarray data analysis [26]. The present 
work is based on the statistic proposed by Efron et al. 
[4]. The test statistic is defined as 

2 _ T,G -^T.G 

(ao, + Sij) 

where Sy is the usual standard deviations and aoi is 
defined to minimize the difference in the coefficient of 
variation of Zy within classes of genes with approxi- 
mately equal variance. A drawback of calculating floi is 
the computational cost. Note that if floi = 0, this reduces 
to usual t-statistic. Here floi is considered to be 90* per- 
centile of all Sij values (Efron el al. [4]). 

When expression measurements between two groups 
are compared for any transcript, the observations are par- 
titioned into two user defined groups of sizes ni and «2 
with n\ +n2 = n. If there is no significant difference 
between the group means, the transcript is assumed to be 
equivalently expressed (EE). On the contrary, if significant 
difference is observed, the transcript is termed as differen- 
tially expressed (DE). For any transcript Tj and SNP Gj it 
may be either of these two: DE or EE. This uncertainty 
can be modeled by a mixture of two distributions as 
follows: 

/ {Zij\0) = TTofo {Zij\0) + TTi/i {Zij\0) (2) 

where ttq is the mxining proportion of EE transcripts 
and jTi = (1 — TTo) is the proportion of DE transcripts, 0 
is a vector parameters involved to characterize the dis- 
tributions. Let Fi be the minor allele frequency of the 
ith SNP then we model the distribution of Z,y as a mix- 
ture model of the form: 

Pr(Zy|FO a Ifo (Zy|F,)]^-^«[/-i (2y|fi)]''' 0) 

where /i(.) denotes the distribution of Z,-, for nonzero 
associations between Gj and Tj and /o(.) denotes the dis- 
tribution of Zij for the zero associations. Sy isdefined as 

^ _ 1 1 i/ nonzero association is present 
I 0 if zero association is present 

For any transcript and any SNP there may be three 
possible relations: no association, positive association 



and negative association. Extending the idea of two 
component mixture model, the distribution of the test 
statistics is modeled by the following mixture model: 

/ {Zy I fi, Fi) = J2l^o Jiikfk {Zij-, Ilk, t:^' ^k) (4) 
Where 

Tti = {jToi, JTii) 0i = [flu, /X2„ T^, t|;) V,' = {vu, Vji) 

with /u-oi = 0, Tgj = I. Mixing proportions T^ik are non- 
negative constantsand sum to one for fixed /o(.) cor- 
responds to distribution for no associationwhereas 
and /2(.) correspond to distributions related to positive 
and negativeassociation respectively. In a recent work. 
Noma and Matsui [27], have used semiparametric hier- 
archical mixture model where the distribution of mean 
expression level of a transcript is considered to be a 
three-component mixture distribution. 

Full Bayesian analysis of (4) will require prior specifica- 
tions oin,0, v,/o(Z) and /i (Z). However, one can use the 
massively parallel structure of microarray data to estimate 
an empirical Bayes estimate of the posterior probability. 
These huge data motivates to be quite empirical rather 
than specifying a-priori models in favor of data-based 
investigations [27]. 

Empirical Bayes, false discovery rates (fdr) and local false 
discovery rate (Ifdr) 

False discovery rate (fdr) is defined as the expected pro- 
portion of errors committed by falsely rejecting null 
hypotheses. Benjamini and Hochberg's [2]fdr criterion 
has very close relation with the empirical Bayes analysis. 
This relation improved the connection between Bayesian 
and frequentist testing theory. The close connection 
between fdr and the empirical Bayes methodology fol- 
lows directly from Bayes theorem and this has been 
established by the "Equivalence theorem" [28]. Tail area 
rejection regions like {Zy < z} are common in the fre- 
quentist framework. According to this theorem, if the 
tail area rejection region is taken to be as large as possi- 
ble subject to the constraint that the estimated Bayes 
proportions of false discoveries is less than a, then the 
frequentist expected proportion of false discoveries is 
also less than a. 

The empirical Bayes approach suggests a local version 
of the fdr called local false discovery rate (Ifdr). The 
Bayes probability that a transcript Tj for SNP Gi is "EE" 
given the test statistic Zy, is known as lfdr{Zy) and it is 
defined as 

Ifdr {Zij} ^ Pr(Tj is EE|Zy) = 7iiofo{Zij)/f{Zij) 
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Analytically, fdr is a conditional expectation of Ifdr 
defined as 

fdr (Z„) = / Ifdr (Z)f (Z) dZ/ f f(Z)dZ = Ef[lfdr{Z\Z < Z^)} 

—00 —00 

For the above set up in (3), 1 — Sy represents the local 
false discovery rate (Ifdr) and fdr can be estimated: 



Sij = 



(1 - 7r,o)/o (Zij) + TTiifi (Zij) 

and hence 



lfdr{Zij) = l-S,j{5) 



{l-nu)fr^,fo{x)djc 



fdr (Z) = 

^ ^ (1 - Ttu) , fo ix) dx + Ttu /?£ , /i (x) dx 



(6) 



voi is estimated bypermutation method (Efron et al. 
[4]) and poi is estimated from the nonnegative constraint 



poi < min 



m 



All other parameters will be estimated by EM algo- 
rithm assuming /;o( ) to be known. There are some prac- 
tical difficulties with the Ifdr that relies on densities. The 
estimation of null becomes more problematic in the far 
tails. It is relatively easier to work with cumulative distri- 
bution function than work with densities. Identification 
of discoveries by Ifdr may not be reproducible for a new 
data. Therefore, even in empirical Bayes framework, fdr 
should be preferred. 
Nonparametrk empirical Bayes (NPEB) 
The main difference between parametric empirical Bayes 
(PEB) and nonparametric empirical Bayes (NPEB) is the 
way in which /i(.) and/2(.) are treated. In PEB model, 
the functional form of and/2 (.) are known, i.e., we 
have a parametric family of priors. In contrast, the NPEB 
does not assume the functional form to be known. 
Though NPEB methods are quite powerful, these are 
more suitable for large sample analyses. To compute the 
fdr under NPEB setup, we have followed the algorithm 
proposed by Efron et al. [4]. 
ECME algorithm 

To fit a mixture model, EM algorithm is widely used. In 
case of t distribution the mean parameter and variance 
component can easily be estimated by EM algorithm 
assuming that degrees of freedom v is known. However 
when V is unknown EM still can be used as demonstrated 
by Lange, Little and Taylor [29]. But this method appears 
to be very slow (Liu and Rubin [30]) and an extension 
has been proposed by Meng and Rubin [31] as ECM 
algorithm. This is a generalization of EM algorithm 
where the E step remains the same butthe M step is 
replaced by CM (constrained or conditional maximiza- 
tion) step. ECM algorithm is basically a generalized EM 



(GEM) as shown by Meng and Rubin [31]. Incidentally, 
the rate of convergence, in terms of iterations, for this 
ECM algorithm is slower compared to EM. To overcome 
this computational problem, Liu and Rubin [30] propose 
an efficient algorithm ECME which is again an extension 
of ECM algorithm. Though this is not a GEM, it con- 
verges faster. 
For the j -th SNP, the complete data is defined as 

DiC = {Zij, Sijki, Sijk2 Sijkn, Ui2 Uin) 

where 



I if s th observation of Zy e kth component 
0 otherwise 



and Ui s are independently distributed gamma 
variables. 

McLachlan and Krishnan [32] have already discussed the 
application of the EM algorithm for ML estimation in case 
of single component t distribution. In ECME algorithm, 
this result has been extended to cover the present set up 
of a 3-component mixture of t distribution. For the sake 
of brevity, in this section we omit the suffix ij for all the 
variables. To define t distribution with mean variance 
and degrees of freedom v, we proceed as follows: 

If Z|L/ = u,5fa = 1 ~ N (^1^, andU-^r 0 

then marginally, Z ~ t(/U., t^, v). 

Following the above definition, the complete data likeli- 
hood Lic can be factorized a product of three terms- 
marginal densities of S s, the conditional densities oiU\S, 
and conditional densities of Z| Lf = u, S. In notation, the 
log-likelihood of the complete-data can be expressed as 

log Lc(ilr) = log Lic(t) + log L2c(v) + log Lsc{0) (7) 

where 

log Lie (JC) = Ek=0 • Es=l Sks logTTfe (8) 



log Ljc (») - eL ■T.liSi,i-iogr(^yl«tiogr(^^ylvi (logu. - «,) - 

logu,) 

and 



log L3C («) ' Hlo ■ T-l, Sks{- \n, log (2n) - - \ ) (10) 

E-Step 

To compute the E-step of the proposed algorithm, at (t+1) 
th step we need to calculate Q(^; ^''5), the current condi- 
tional expectation of the complete-data log likelihood 
function log Lc (^). From equation (4) to (7), we can write 

Q (f ; f «) = Qi (it; f + Q2 (v; f ®) + Q3 {»; f ^")(11) 
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where 
and 



/(Z;iA('+i)) 



(13) 



which is the posterior probabiUty that Z belongs to the 
k-th component of the mixture based on current fit i|r®. 
Similarly, 



(v; -EE fi"[-logr (D+iv. logr (^)+i^{E (logut,* - 

Where 



(14) 



(!) 1 



.(0 



(15) 



is a digamma function and 

Q3 («; = EL . S, IobP't) + i logufeO) - - nM^kf)] (16) 

C/M-step 

In usual M-step parameters it, v, 0 can be estimated by 
considering equations (10) - (12) independently. The 
new updates for n, 6 can be obtained as a closed form 
solution whereas for v an iterative procedure may be 
used using the following equations: 



(I) 



s=l " 



(17) 



(18) 



(t+1) _ tW,,(t) 



(19) 



and v^'*^' is the solution of the following equation 



(20) 



To get an efficient algorithm, let us partition T|f as 
{"if' i,'^' 2)' where ilfj contains all the parameters except 
parameters corresponding to degree of freedom of t-dis- 
tributions. The above M-step is replaced by two CM- 
steps, as follows. 

CM-Step 1. Keeping "if 2 fixed, i.e. v is fixed at v®. 
maximize Q (ilr; \If®) to get i|f^/*^^ 



CM-Step 2. Now fix ii at \|f<'*^' and calculate i|f^,'*^' 
by maximizing Q (a|f; i|f *'^) 

Furthermore to make the algorithm more efficient, 
after the first CM-step, we replace the E-step with 
i|f = (ilf instead of ,Ir = (_-iff',^f)'. 

Results 

Simulation study 

To assess the proposed methodology, a small sample 
simulation study has been performed. This gives an idea 
whether or not the parameters are well estimated and 
most importantly, they provide information of false dis- 
covery rates. 

First we simulated a dominant model with 10,000 
transcripts and 10 SNPs. The equivalently expressed 
(EE) transcripts are generated from N(0,1) after log- 
transformation. We have simulated the data under three 
choices of proportions of differentially expressed (DE) 
transcripts (pi). We have taken pi to be (0.01, 0.05, 
0.10). If the transcript is DE, it has to be generated from 
N(4,0.5) after log-transformation. The controlled fdr are 
also assumed to be (0.01, 0.05, 0.10) for these data sets. 
For p\ = 0.05, the simulated data is given in Figure 1. 

The impact of minor allele frequency (MAF) on the 
distributions under null has also been studied. Under 
null, for a t-distribution, the only parameter to be esti- 
mated is its degrees of freedom. The comparison has 
been made by computing different quantiles for six 
choices of MAFs. For the lower quantiles, they almost 
overlapped with each other. Very small deviations are 
observed for upper quantiles (Figure 2). 

For the 10 SNPs, we fitted the null distribution using 
permutation method in a balanced way. From each 
group, randomly selecterd 35 samples are shifted from 
one group to the other and the value of the statistic is 
noted. This process is repeated 40 times and histograms 
are plotted. From the histograms, the degrees of free- 
dom corresponding to the null distribution for eack 
SNP is estimates. To get an idea about the goodness-of- 
fit, Q-Q plots are done (Figure 3). These plots show 
that the null distribution is well approximated by the 
standardized t-distribution with appropriate degrees of 
freedom. 

Parameters related to the mixture model (4) are esti- 
mated using proposed ECME algorithm after estimating 
the null distribution using permutation method. Then 
FDR is computed under both proposed parametric 
empirical Bayes and nonparmetic empirical Bayes setup 
and the result is given in Table 1. 

It is evident from the above table that the nonparma- 
teric empirical Bayes is much conservative compared to 
its parametric alternative. For parametric set up, the 
true FDR is very much close to the controlled one. 
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Figure 1 A part of the simulated data for pi = 0.05 
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whereas, for nonparametric empirical Bayes these values 
are not so close as the true fraction of DE transcripts 
increases. 

HLC data analysis 

We applied the empirical Bayes model to analyze a 
sequencing data publicly available. In the current 
study, we have started with liver tissue data of 213 
Caucasian samples from apreviously described human 
liver cohort (LHC) (Yang et al. [33]). To get the geno- 
types and gene expression profiles, DNA and RNA 
have been isolated. Illumina platform is used to get the 
expressions. After putting some filtration (MAF>5%, 
HWE<10 ^) we are left with 173 samples, 472,000 
SNPs and 30,000 expressions. 




Figure 2 Effect of minor allele frequency (MAF) on the null 
distribution. Only upper quantiles (from 80%) have been 
considered as lower quantiles showing almost no difference. 



The distribution of minor allele frequency (MAF) over 
SNPs is given in the histogram (Figure 4). For all possi- 
ble SNP-transcript combinations, test statistic, Zij s are 
computed. We fit the mixture model using the ECME 
algorithm in R 2.15.1 after estimating the null distribu- 
tion using permutation method. However, due to high 
dimension data, it becomes very difficult to fit a mixture 
model using the proposed algorithm. For the sake of 
parsimony, we further filtered the data and ECME algo- 
rithm is used for only top SNPs with p — value < 10^^. 
For these top SNPs, the mixture model is fitted and esti- 
mates are obtained. To compute Ifdr and FDR from (5) 
and (6) respectively, these estimates are used. 

Conclusion 

To compare our result with [33], we focus on 18 of the 54 
P450 genes used in the study. These are CYP3A5, CYP2D6, 
CYP4F12, CYP2E1, CYP2U1, CYPIBI, CYP2C18, 
CYP4F11, CYP4V2, CYP2F1, CYP39A1, CYP26C1, 
CYP2C19, CYP2C9, CYP2S1, CYP46A1, CYP4A11 and 
CYP4Xl.However our method fails to identify a single SNP 
with FDR<10% for CYP2R1 and that gene symbol has been 
excluded from the table (Table 2). It can be seen from the 
table (Table 2) that for a threshold of 10% FDR number of 
significant eQTL pairsis4916.Since we have considered only 
top SNPs, this may be an overestimate. SNPs which are 
within <1-Mb distance from gene location are defined as 
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Table 1 The True FDR Performance of Controlled FDR in EB Models 
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Figure 4 Minor allele frequency (MAF) distribution. X axis 
corresponds to mirror allele frequericy 25% to 50%. 



cis-SNPs. It is interesting to note that, among these 18 
genes, the first five {CYP3A5, CYP2D6, CYP4F12, CYP2E1 
and CYP2U1) having more than 40 cis-SNPs. In all cases 
FDR based analysis results in identifying more cis-SNPs for 
these 18 genes compared to that of Yang et al. (2010) [33]. 

Discussion 

In contrast to previously available methods based on p- 
values, the empirical Bayes method uses local false dis- 
covery rate (Ifdr) as the threshold. This method controls 
false positive rate. For a particular SNP, the Ifdr is com- 
puted for the site-specific evidence whereas the FDR 
averages over other sites with stronger evidence. There 
are some limitations of using FDR which may result in 
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Table 2 Number of eQTL pairs after crossing the threshold of FDR 


Gene symbol 


No. of SNPs (FDR<10%) 


No. of cis-SNP 


No. of cis-eSNP (FDR<10%) by Yang et al. (2010) 


CYP3A5 


253 


52 


55 


CYP2D6 


254 


57 


54 


CYP4F12 


392 


55 


45 


CYP2E1 


130 


45 


31 


CYP2U1 


549 


45 


26 


CYP1B1 


158 


21 


13 


CYP2C1 8 


90 


13 


9 


CYP4F1 1 


159 


15 


7 


CYP4V2 


159 


25 


3 


CYP2F1 


324 


10 


2 


CYP39A1 


448 


17 


2 


CYP25C1 


154 


29 


1 


CYP2C19 


355 


7 


1 


CYP2C9 


413 


20 


1 


CYP2S1 


319 


10 


1 


CYP46A1 


430 


7 


1 


CYP4A1 1 


451 


4 


1 


CYP4X1 


151 


3 


1 



misleading inferences in genome studies. In such a situa- 
tion, it is better to use Ifdr which is a bit difficult to esti- 
mate compared to FDR. However there is still one 
computational problem which needs much attention. Due 
to the high dimensionality in the data, sometimes existing 
algorithms fail. This necessitates the need to find some 
more efficient algorithms. The choice of threshold FDR 
value is an important deciding factor in such studies. It 
would be interesting to see, how number of cis-SNPs vary 
with the change in FDR threshold. In this way FDR criter- 
ion can be used to estimate number of SNPs that we may 
need to consider. 
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